23 research outputs found
Naturaren kume ezkutuak
Artikulu honetan zulo beltzak ditugu aztergai. Astro bitxi horiek behar bezala ulertzeko erlatibitate orokorra ezinbestekoa denez, artikuluaren lehen orrietan teoria horren inguruko ideia nagusienak aztertzen dira. Segidan, izarren heriotza hartzen da hizpide. Izar handienak zulo beltz bihurtzen direla ikusiko dugu. Zulo beltzek dauzkaten hainbat ezaugarri harrigarri ere lantzen dira artikuluan. Atal horretan zulo beltzen lurrinketak hartzen du lekurik garrantzitsuena. Bertan, kuantikaren oinarrizko kontzeptu batzuk azaldu ostean, zulo beltzak ez direla hain beltzak erakusten da. Azkenik, singularitateen mundu ilunean murgilduko da irakurlea. Dauden argi-izpi bakan horien atzetik joaten saiatuko gara, eta ameslarienentzat, denbora-makinen inguruko xehetasun batzuk ere landuko dira
Egocentric Vision-based Action Recognition: A survey
[EN] The egocentric action recognition EAR field has recently increased its popularity due to the affordable and lightweight wearable cameras available nowadays such as GoPro and similars. Therefore, the amount of egocentric data generated has increased, triggering the interest in the understanding of egocentric videos. More specifically, the recognition of actions in egocentric videos has gained popularity due to the challenge that it poses: the wild movement of the camera and the lack of context make it hard to recognise actions with a performance similar to that of third-person vision solutions. This has ignited the research interest on the field and, nowadays, many public datasets and competitions can be found in both the machine learning and the computer vision communities. In this survey, we aim to analyse the literature on egocentric vision methods and algorithms. For that, we propose a taxonomy to divide the literature into various categories with subcategories, contributing a more fine-grained classification of the available methods. We also provide a review of the zero-shot approaches used by the EAR community, a methodology that could help to transfer EAR algorithms to real-world applications. Finally, we summarise the datasets used by researchers in the literature.We gratefully acknowledge the support of the Basque Govern-ment's Department of Education for the predoctoral funding of the first author. This work has been supported by the Spanish Government under the FuturAAL-Context project (RTI2018-101045-B-C21) and by the Basque Government under the Deustek project (IT-1078-16-D)
Combining Users' Activity Survey and Simulators to Evaluate Human Activity Recognition Systems
Open Access articleEvaluating human activity recognition systems usually implies following expensive and time-consuming methodologies,where experiments with humans are run with the consequent ethical and legal issues. We propose a novel evaluation methodology to overcome the enumerated problems, which is based on surveys for users and a synthetic dataset generator tool. Surveys allow capturing how different users perform activities of daily living, while the synthetic dataset generator is used to create properly labelled activity datasets modelled with the information extracted from surveys. Important aspects, such as sensor noise, varying time lapses and user erratic behaviour, can also be simulated using the tool. The proposed methodology is shown to have very important advantages that allow researchers to carry out their work more efficiently. To evaluate the approach, a syntheticdatasetgeneratedfollowingtheproposedmethodologyiscomparedtoarealdataset computing the similarity between sensor occurrence frequencies. It is concluded that the similarity between both datasets is more than significant
Embedding-based real-time change point detection with application to activity segmentation in smart home time series data
[EN]Human activity recognition systems are essential to enable many assistive applications. Those systems can be sensor-based or vision-based. When sensor-based systems are deployed in real environments, they must segment sensor data streams on the fly in order to extract features and recognize the ongoing activities. This segmentation can be done with different approaches. One effective approach is to employ change point detection (CPD) algorithms to detect activity transitions (i.e. determine when activities start and end). In this paper, we present a novel real-time CPD method to perform activity segmentation, where neural embeddings (vectors of continuous numbers) are used to represent sensor events. Through empirical evaluation with 3 publicly available benchmark datasets, we conclude that our method is useful for segmenting sensor data, offering significant better performance than state of the art algorithms in two of them. Besides, we propose the use of retrofitting, a graph-based technique, to adjust the embeddings and introduce expert knowledge in the activity segmentation task, showing empirically that it can improve the performance of our method using three graphs generated from two sources of information. Finally, we discuss the advantages of our approach regarding computational cost, manual effort reduction (no need of hand-crafted features) and cross-environment possibilities (transfer learning) in comparison to others.This work was carried out with the financial support of FuturAALEgo (RTI2018-101045-A-C22) granted by Spanish Ministry of Science, Innovation and Universities
Do Multilingual Language Models Think Better in English?
Translate-test is a popular technique to improve the performance of
multilingual language models. This approach works by translating the input into
English using an external machine translation system, and running inference
over the translated input. However, these improvements can be attributed to the
use of a separate translation system, which is typically trained on large
amounts of parallel data not seen by the language model. In this work, we
introduce a new approach called self-translate, which overcomes the need of an
external translation system by leveraging the few-shot translation capabilities
of multilingual language models. Experiments over 5 tasks show that
self-translate consistently outperforms direct inference, demonstrating that
language models are unable to leverage their full multilingual potential when
prompted in non-English languages. Our code is available at
https://github.com/juletx/self-translate
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset
Existing work has observed that current text-to-image systems do not
accurately reflect explicit spatial relations between objects such as 'left of'
or 'below'. We hypothesize that this is because explicit spatial relations
rarely appear in the image captions used to train these models. We propose an
automatic method that, given existing images, generates synthetic captions that
contain 14 explicit spatial relations. We introduce the Spatial Relation for
Generation (SR4G) dataset, which contains 9.9 millions image-caption pairs for
training, and more than 60 thousand captions for evaluation. In order to test
generalization we also provide an 'unseen' split, where the set of objects in
the train and test captions are disjoint. SR4G is the first dataset that can be
used to spatially fine-tune text-to-image systems. We show that fine-tuning two
different Stable Diffusion models (denoted as SD) yields up to 9
points improvements in the VISOR metric. The improvement holds in the 'unseen'
split, showing that SD is able to generalize to unseen objects.
SD improves the state-of-the-art with fewer parameters, and avoids
complex architectures. Our analysis shows that improvement is consistent for
all relations. The dataset and the code will be publicly available.Comment: 12 pages and 5 figure
Image captioning for effective use of language models in knowledge-based visual question answering
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.Ander is funded by a PhD grant from the Basque Government (PRE_2021_2_0143). This work is based upon work partially supported by the Ministry of Science and Innovation of the Spanish Government (DeepKnowledge project PID2021-127777OB-C21), and the Basque Government (IXA excellence research group IT1570-22)